# Checking for legislative text reuse using Solr and ngrams

In reproducing [this piece on model legislation](https://www.usatoday.com/pages/interactives/asbestos-sharia-law-model-bills-lobbyists-special-interests-influence-state-laws/), we need to somehow compare our model legislation - bills written by lobbyists - with actual legislation that was proposed or passed. To do this, we'll narrow down our pool of potential matches using a simple text search, then leverage that in a scikit-learn-based comparison.

<p class="reading-options">
  <a class="btn" href="/azcentral-text-reuse-model-legislation/05-checking-for-legislative-text-reuse-using-python-solr-and-simple-text-search">
    <i class="fa fa-sm fa-book"></i>
    Read online
  </a>
  <a class="btn" href="/azcentral-text-reuse-model-legislation/notebooks/05-Checking for legislative text reuse using Python, Solr, and simple text search.ipynb">
    <i class="fa fa-sm fa-download"></i>
    Download notebook
  </a>
  <a class="btn" href="https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/azcentral-text-reuse-model-legislation/notebooks/05-Checking for legislative text reuse using Python, Solr, and simple text search.ipynb" target="_new">
    <i class="fa fa-sm fa-laptop"></i>
    Interactive version
  </a>
</p>

### Prep work: Downloading necessary files
Before we get started, we need to download all of the data we'll be using.
* **alec-model-policies.csv:** alec model legislation - TK


In [None]:
# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/azcentral-text-reuse-model-legislation/data/alec-model-policies.csv -P data

In [1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 200)
pd.set_option("display.max_colwidth", 3000)

## What are we doing here?

## Read in model bills

We start by reading in a list of model bills scraped from the [American Legislative Exchange Council](https://www.alec.org/), a leading source of model legislation. In this notebook we're going to look for legislation based off of a single one of these bills. 

In [2]:
df = pd.read_csv("data/alec-model-policies.csv")
df.head(2)

Unnamed: 0,title,url,content
0,Resolution Supporting Congressional Approval of the United States-Mexico-Canada Agreement (USMCA),https://www.alec.org/model-policy/resolution-supporting-congressional-approval-of-the-united-states-mexico-canada-agreement-usmca/,"\n\nDraft\nResolution Supporting Congressional Approval of the United States-Mexico-Canada Agreement (USMCA)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWhereas, the imposition of artificial barriers to free and open trade are harmful to American economic interests; and\nWhereas, together, the United States, Canada and Mexico promote a shared belief in freedom, representative democracy and market principles as recognized in the U.S. Constitution; and\nWhereas, a longstanding, close tri-lateral relationship, codified in the North American Free Trade Agreement (NAFTA), has existed between the United States, Canada, and Mexico for more than 25 years and has proven economically, culturally and strategically important for all parties and this relationship will continue with ratification of USMCA; and\nWhereas, trade with Canada and Mexico supports nearly 12 million American jobs, and nearly 5 million of those jobs are supported by increased trade generated by NAFTA and these benefits will continue with ratification of USMCA; and\nWhereas, since NAFTA entered into force in 1994, trade with Canada and Mexico has nearly quadrupled to $1.3 trillion, and the two countries buy more than one-third of U.S. merchandise exports; and\nWhereas, for 43 states in the United States, Canada and Mexico represent their first or second largest export market and all but one U.S. state count Canada or Mexico as a top three trading partner; and\nWhereas, Canada and Mexico are the two largest trading partners for [INSERT STATE] with [INSERT PERCENTAGE AVAILABLE ON USTR WEBSITE] percent of the state’s goods exports going to Canada and another [INSERT APPROPRIATE PERCENTAGE AVAILABLE ON USTR WEBSITE] percent going to Mexico; and\nWhereas, NAFTA has contributed to a 405% increase in U.S. agricultural exports to Canada and Mexico; and\nWhereas, the modernized USMCA may prove even more beneficial to the agricultural sector than NAFTA and will offer a higher degree of certainty and stability to farmers; and\nWhereas, U.S. service exports to Canada and Mexico have tripled, rising from $27.5 billion in 1993 to $91.3 billion in 2017, thanks to new market access and clearer rules afforded by NAFTA which will be continued under USMCA; and\nWhereas, Canada and Mexico are the top two export destinations for U.S. small and medium-sized enterprises, more than 125,000 of which sold their goods and services in Canada and Mexico in 2014; now\nWhereas, trade among our North American trading partners is made up predominantly of intellectual property (IP)-intensive goods and services that employ millions of Americans in high paying jobs and generate billions of dollars in economic output; and\nWhereas, many of the IP-intensive goods, services and exchanges through which trade is facilitated in the NAFTA bloc did not exist when the agreement was drafted and this situation has resulted in uneven and weak IP enforcement; and\nWhereas, stringent enforcement of IP rights has been found to correlate c..."
1,Resolution Supporting the Intellectual Property (IP) Provisions in the United States-Mexico-Canada Agreement (USMCA),https://www.alec.org/model-policy/draft-resolution-supporting-the-intellectual-property-ip-provisions-in-the-united-states-mexico-canada-agreement-usmca/,"\n\nDraft\nResolution Supporting the Intellectual Property (IP) Provisions in the United States-Mexico-Canada Agreement (USMCA)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWhereas, the American Legislative Exchange Council (ALEC) policy on free trade acknowledges that, “the imposition of artificial barriers to free and open trade…are deterrents to American economic interests;” and\nWhereas, the United States, Canada and Mexico share a belief in freedom, representative democracy and market principles as recognized in the U.S. Constitution; and\nWhereas, trade among our North American trading partners is made up predominantly of intellectual property (IP)-intensive goods and services that employ millions of Americans in high paying jobs and generate billions of dollars in economic output; and\nWhereas, many of the IP-intensive goods, services and exchanges through which trade is facilitated in the NAFTA bloc did not exist when the agreement was drafted and this situation has resulted in uneven and weak IP enforcement; and\nWhereas, trade agreements are the most appropriate mechanism to harmonize and strengthen IP rights protections ensuring domestic and foreign business are on the same equal footing before the law; and\nWhereas, stringent enforcement of IP rights has been found to correlate closely with greater household income, Foreign Direct Investment, and Gross Domestic Product; and\nWhereas, the IP provisions found in the USMCA are the most comprehensive of any multilateral U.S. trade agreement and are vastly superior to those included in NAFTA;\nTherefore be it resolved, that ALEC applauds the intellectual property provisions in the United States- Mexico-Canada Agreement; and\nBe it further resolved, that ALEC urges the President of the United States to retain NAFTA until USMCA is implemented to ensure continuity in trade among the three North American economic partners; and\nBe it further resolved, that upon adoption, an official copy of this Resolution be prepared and presented to the President of the United States, to the Chairmen and Ranking members, and all other members of the U.S. Senate Finance and the U.S. House Ways and Means Committees, to the members of the Senate and House Advisory Groups on Negotiations, to the U.S. Trade Representative, to the U.S. Secretaries of Commerce, State, and Labor, to the Director of the Office of Management and Budget and to the Intellectual Property Enforcement Coordinator.\n \n \n\n"


# Pick the model bill we're interested in

In this notebook we're only looking at a **single source** of model legislation. Let's pick one at random:

In [24]:
target = df.loc[200]
target

title                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

In [36]:
len(target.content)

6832

## Search for content like that bill on Solr

We have some pretty intense text analysis to do, the kind you couldn't do across 1.2 million documents. Instead, we're going to use Solr to pare down our results a bit, then perform our text analysis on that subset.

First, we'll see which bills are **kind of similar** on Solr. We'll do this by adding the model legislation, and asking for ["more like this"](https://lucene.apache.org/solr/guide/8_2/morelikethis.html). Once we have a list of similar bills, we'll delete the model legislation from Solr and perform similarity measurements on the model legislation and the similar bills.

In [25]:
import pysolr

solr = pysolr.Solr('http://localhost:8983/solr/legislation', always_commit=True)

# Delete previous samples if they're still hanging around
solr.delete(q='bill_id:0')

'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n\n<lst name="responseHeader">\n  <int name="status">0</int>\n  <int name="QTime">605</int>\n</lst>\n</response>\n'

In [26]:
# Add the model legislation
solr.add([{ 'content': target.content, 'bill_id': 0 }])

'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n\n<lst name="responseHeader">\n  <int name="status">0</int>\n  <int name="QTime">229</int>\n</lst>\n</response>\n'

In [27]:
import requests

response = requests.get('http://localhost:8983/solr/legislation/mlt?q=bill_id:0&rows=200&fl=bill_id,score')
data = response.json()
data

{'responseHeader': {'status': 0, 'QTime': 1219},
 'match': {'numFound': 1,
  'start': 0,
  'maxScore': 1.0,
  'docs': [{'bill_id': 0, 'score': 1.0}]},
 'response': {'numFound': 933318,
  'start': 0,
  'maxScore': 36.04369,
  'docs': [{'bill_id': 980612, 'score': 36.04369},
   {'bill_id': 495403, 'score': 34.73769},
   {'bill_id': 676985, 'score': 34.719498},
   {'bill_id': 495405, 'score': 34.406055},
   {'bill_id': 1051344, 'score': 33.967087},
   {'bill_id': 1039050, 'score': 33.23528},
   {'bill_id': 733408, 'score': 33.159477},
   {'bill_id': 749353, 'score': 32.85357},
   {'bill_id': 453419, 'score': 32.46553},
   {'bill_id': 890722, 'score': 32.41979},
   {'bill_id': 453394, 'score': 32.107853},
   {'bill_id': 495401, 'score': 31.921883},
   {'bill_id': 50474, 'score': 31.81652},
   {'bill_id': 625605, 'score': 31.787214},
   {'bill_id': 677835, 'score': 31.761711},
   {'bill_id': 1099535, 'score': 31.506304},
   {'bill_id': 700217, 'score': 31.44896},
   {'bill_id': 851195, 'sco

In [28]:
morelikethis = pd.DataFrame(data['response']['docs'])
morelikethis.head()

Unnamed: 0,bill_id,score
0,980612,36.04369
1,495403,34.73769
2,676985,34.719498
3,495405,34.406055
4,1051344,33.967087


# Query database

In [29]:
from sqlalchemy import create_engine
engine = create_engine('postgresql://localhost:5432/legislation')

query = "select * from bills where bill_id = ANY(ARRAY{})".format(list(morelikethis.bill_id))
matches_df = pd.read_sql_query(query, engine)
matches_df.head(2)

Unnamed: 0,id,bill_id,code,bill_number,title,description,state,session,filename,status,status_date,url,error,content,processed_at
0,684842,40145,HB2044,HB2044,Further providing for private actions.,"An Act amending the act of December 17, 1968 (P.L.1224, No.387), known as the Unfair Trade Practices and Consumer Protection Law, further providing for private actions.",PA,2009-2010 Regular Session,bill_data/PA/2009-2010_Regular_Session/bill/HB2044.json,2,2010-09-28,http://www.legis.state.pa.us/cfdocs/legis/PN/Public/btCheck.cfm?txtType=HTM&sessYr=2009&sessInd=0&billBody=H&billTyp=B&billNbr=2044&pn=2812,,"Regular Session 2009-2010 House Bill 2044 P.N. 2812 \n\n\t\n\t \n\n\t\t \n\t \n\tPRINTER'S NO. 2812\n\n\n\n\t\n\t \n\n\t\n\tTHE GENERAL ASSEMBLY OF PENNSYLVANIA\n\n\t\n\t \n\n\t\n\tHOUSE BILL\n\n\t\t \n\tNo.\n\t2044\n\tSession of\n2009\n\n\n\n\t\n\t \n\n\t\n\t \n\n\t\n\tINTRODUCED BY DeLUCA, BELFANTI, BOBACK, CALTAGIRONE, CLYMER, D. COSTA, EVERETT, FRANKEL, HARPER, JOSEPHS, KOTIK, LONGIETTI, MILLER, MURT, MYERS, PHILLIPS, QUINN, SIPTROTH, THOMAS, WALKO AND WATERS, OCTOBER 14, 2009\n\n\t\n\t \n\n\t\n\t \n\n\t\n\tREFERRED TO COMMITTEE ON CONSUMER AFFAIRS, OCTOBER 14, 2009 \n\n\t\n\t \n\n\t\n\t \n\n\t\n\t \n\n\t\n\tAN ACT\n\n\t\n\t \n\n\t1\n\tAmending the act of December 17, 1968 (P.L.1224, No.387), \n\n\t2\n\tentitled ""An act prohibiting unfair methods of competition \n\n\t3\n\tand unfair or deceptive acts or practices in the conduct of \n\n\t4\n\tany trade or commerce, giving the Attorney General and \n\n\t5\n\tDistrict Attorneys certain powers and duties and providing \n\n\t6\n\tpenalties,"" further providing for private actions.\n\n\t7\n\tThe General Assembly of the Commonwealth of Pennsylvania \n\n\t8\n\thereby enacts as follows:\n\n\t9\n\tSection 1. Section 9.2 of the act of December 17, 1968 \n\n\t10\n\t(P.L.1224, No.387), known as the Unfair Trade Practices and \n\n\t11\n\tConsumer Protection Law, reenacted and amended November 24, 1976 \n\n\t12\n\t(P.L.1166, No.260), amended December 4, 1996 (P.L.906, No.146) \n\n\t13\n\tand repealed in part April 28, 1978 (P.L.202, No.53), is amended \n\n\t14\n\tto read:\n\n\t15\n\tSection 9.2. Private Actions.--(a) Any person who purchases \n\n\t16\n\tor leases goods or services primarily for personal, family or \n\n\t17\n\thousehold purposes and thereby suffers any ascertainable loss of \n\n\t18\n\tmoney or property, real or personal, as a result of the use or \n\n\t19\n\temployment by any person of a method, act or practice declared \n\n\t\t\n\t\n\t \n\n\t\n\n\n\n\t1\n\tunlawful by section 3 of this act, may bring a private action to \n\n\t2\n\trecover actual damages or [one hundred dollars ($100)] five \n\n\t3\n\thundred dollars ($500), whichever is greater. The court may, in \n\n\t4\n\tits discretion, award up to three times the actual damages \n\n\t5\n\tsustained, but not less than [one hundred dollars ($100)] five \n\n\t6\n\thundred dollars ($500), and may provide such additional relief \n\n\t7\n\tas it deems necessary or proper. The court may award to the \n\n\t8\n\tplaintiff, in addition to other relief provided in this section, \n\n\t9\n\tcosts and reasonable attorney fees.\n\n\t10\n\t(b) Any permanent injunction, judgment or order of the court \n\n\t11\n\tmade under section 4 of this act shall be prima facie evidence \n\n\t12\n\tin an action brought under section 9.2 of this act that the \n\n\t13\n\tdefendant used or employed acts or practices declared unlawful \n\n\t14\n\tby section 3 of this act.\n\n\t15\n\tSection 2. This act shall apply to all causes of act...",2019-11-18 00:14:39.692329+00:00
1,1015166,48648,A03243,A03243,Establishes it shall be unlawful for a person to have his or her application to rent or lease a residence to be denied due to a previous housing court proceeding; allows a person aggrieved to maintain a civil action.,"To protect tenants from discrimination based on prior landlord-tenant litigation, or tenant screening reports, when applying for new housing.",NY,2009 General Assembly,bill_data/NY/2009-2010_General_Assembly/bill/A03243.json,1,2009-01-23,https://assembly.state.ny.us/leg/?default_fld=&bn=A03243&term=2009&Summary=Y&Actions=Y&Text=Y&Committee%26nbspVotes=Y&Floor%26nbspVotes=Y#A03243A,,"New York State Assembly | Bill Search and Legislative Information\n\n\n\n\n \n\n\n\n\n\n\n\n \n\t\n\t\n \n\n \n\n\n \n\n \n \n \t \n \n \n \n \n\n\n \n New York State\n\n Assembly\n\n Speaker Carl E. Heastie\n\n\n \n\n\n \n \n \n\n \n\n \n\n \n \n \n\n \n\n \n\n \n \n WATCH LIVE\n\n \n\n \n\n \n\n \n \n\n \n \n\n \n\n \n \n \n\n\n\tAssembly Members\n\tBill Search & \nLegislative Info\n\tStanding Committee Public Hearing Calendar\n\tSpeaker's \nPress Releases\n\tAssembly Reports\n\tCommittees, Commissions \n& Task Forces\n\n\n\n\n\n\n\n \n\n \n\n\t\n\t\n\n\t \n\t \n\t \n\t \n\n\n\n\n\tJavascript must be enabled to properly view this page.\n\n\t\n\n\n\n\nBill Search\nHome\nLaws\n \nLegislative\nCalendar\nPublic\nHearing Schedule\nAssembly\nCalendars\nAssembly\nCommittee Agenda\n\n\n\n\n\n\n\n\n\n\t\n\n\n\n\n\t\tBill No.: \n\t\t \n\n\t\t\n\n \t Summary \n\n \t Actions \n\n \t Floor&nbspVotes \n\n \t Memo \n\n \t Text \n\n\n\nA03243 Summary:\n\tBILL NO\tA03243A\n\t \n\tSAME AS\tSAME AS S03856-B\n\n\t \n\tSPONSOR\tO'Donnell (MS)\n\t \n\tCOSPNSR\tLopez V, Kellner, Alfano\n\t \n\tMLTSPNSR\tBarra, Clark, Glick, Rivera N\n\t \n\tAdd S235-g, RP L\n\t \n\tEstablishes it shall be unlawful for a person to have his or her application to rent or lease a residence to be denied due to a previous housing court proceeding; allows a person aggrieved to maintain a civil action. \n\nGo to top \nA03243 Actions:\n\tBILL NO\tA03243A\n\t \n\t01/23/2009\treferred to judiciary\n\t01/06/2010\treferred to judiciary\n\t06/15/2010\tamend and recommit to judiciary\n\t06/15/2010\tprint number 3243a\n\nGo to top\nA03243 Floor&nbspVotes:\nThere are no votes for this bill in this legislative session.\nGo to top\nA03243 Text:\n\n\n\n\n\n \n STATE OF NEW YORK\n ________________________________________________________________________\n \n 3243--A\n \n 2009-2010 Regular Sessions\n \n IN ASSEMBLY\n \n January 23, 2009\n ___________\n \n Introduced by M. of A. O'DONNELL, V. LOPEZ, KELLNER, ALFANO -- Multi-\n Sponsored by -- M. of A. BARRA, CLARK, GLICK, N. RIVERA -- read once\n and referred to the Committee on Judiciary -- recommitted to the\n Committee on Judiciary in accordance with Assembly Ru...",2019-11-17 23:51:02.294869+00:00


# Build vectorizer on input text

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(binary=True, ngram_range=(6,6))

In [31]:
%%time
vectorizer.fit([target.content])

CPU times: user 10.5 ms, sys: 8.92 ms, total: 19.4 ms
Wall time: 80.3 ms


CountVectorizer(analyzer='word', binary=True, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(6, 6), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [32]:
vectorizer.get_feature_names()[:10]

['2005 reapproved by alec board of',
 '2013 amended by the alec board',
 '2014 this provision is needed only',
 '28 2013 amended by the alec',
 '500 per person whichever is greater',
 'absence of the unlawful act or',
 'accord with state or federal law',
 'act does not provide for statutory',
 'act for damages for an act',
 'act including whether the person took']

In [33]:
matrix = vectorizer.transform(matches_df.content)
sums = matrix.sum(axis=1)
sums[:10]

matrix([[6],
        [0],
        [0],
        [3],
        [0],
        [1],
        [1],
        [5],
        [1],
        [0]])

In [39]:
pd.DataFrame({
    'matches': np.squeeze(np.asarray(sums)),
    'bill_id': matches_df.bill_id,
    'title': matches_df.title,
    'code': matches_df.state + "-" + matches_df.bill_number
}).sort_values(by='matches', ascending=False).head(10)

Unnamed: 0,matches,bill_id,title,code
85,17,587843,Consumer protection.,IN-SB0394
182,16,1139721,"Telephone solicitation. Adds to the list of telephone calls that are exempt from the ""do not call"" statute any telephone call made to a consumer by a caller that: (1) is: (A) a communications service provider that offers broadband internet service; or (B) a financial institution or a person licensed by the department of financial institutions to engage in first lien mortgage transactions or consumer credit transactions; and (2) has an established business relationship with the consumer. Requires the consumer protection division of the attorney general's office (division) to notify Indiana residents of the following: (1) The prohibition under federal law against a person making any call using an: (A) automatic telephone dialing system; or (B) artificial or prerecorded voice; to any telephone number assigned to a mobile telecommunications service. (2) The prohibition under federal law against a person initiating any telephone call to any residential telephone line using an artificial or prerecorded voice to deliver a message without the prior consent of the called party. (3) Information concerning the placement of a telephone number on the National Do Not Call Registry operated by the Federal Trade Commission. Allows the division to use the consumer protection division telephone solicitation fund (fund) to: (1) administer the statutes concerning: (A) the registration of telephone solicitors; and (B) the regulation of automatic dialing machines; and (2) reimburse county prosecutors for expenses incurred in extraditing violators of these and other state and federal statutes concerning telephone solicitations. (Current law provides that the fund may be used only to administer: (1) the state's ""do not call"" statute; (2) the federal statute concerning restrictions on the use of telephone equipment; and (3) the state statute concerning misleading or inaccurate caller identification (caller ID statute).) Provides that certain civil penalties recovered by the attorney general for violations of the statutes concerning: (1) the registration of telephone solicitors; and (2) the regulation of automatic dialing machines; shall be deposited in the fund. Defines ""executive"" for purposes of the ""do not call"" statute, and provides that an executive of a person that violates the ""do not call"" statute commits a separate deceptive act actionable by the division. Provides that the attorney general can collect attorney fees and costs in a civil action for a violation of the caller ID statute. Amends the definition of ""seller"" for purposes of the statute requiring telephone solicitors to register with the division, so that the definition includes any person making a telephone solicitation. (Current law includes only persons making specified false representations in a telephone solicitation.) Provides that all sellers that make telephone solicitations must register with the division. (Under current law, registration is required only if the seller makes a solicitation ...",IN-HB1123
109,15,700217,Relating to civil actions filed under Consumer Protection Act,WV-SB315
146,11,930123,Office of Consumer Protection; clarify acts excluded from regulation of.,MS-HB1417
84,10,580874,Prices charged to retailers by suppliers.,IN-HB1068
143,10,917370,Mississippi Consumer Protection Act; revise.,MS-SB2404
98,10,669489,"Debt collection. Amends the statute concerning deceptive consumer sales as follows: (1) Defines the term ""debt buyer"". (2) Specifies that a debt buyer is a debt collector for purposes of the statute. (3) Requires a debt collector to make certain disclosures to an Indiana debtor. (4) Provides that the failure to make the required disclosures constitutes a deceptive act under the statute. (5) Specifies that the attorney general's authority to recover a civil penalty not exceeding $1,000 for knowing violations of the provisions concerning debt collection practices applies to each violation of the provisions per consumer, subject to a cap",IN-SB0211
180,9,1132105,Provides that a person who is injured by a product has 15 years after the sale or lease of the product to bring a suit for damages.,MO-HB186
175,9,1099535,"Modifies various provisions relating to civil procedure, tort claims, contingency fee contracts entered into by the state, unlawful merchandising practices, arbitration agreements between employers and employees, damages, and products liability",MO-SB1102
19,9,126000,An Act Relating To Commercial Law -- General Regulatory Provisions -- Deceptive Trade Practices (would Require That A Party Alleging An Unfair Or Deceptive Act Or Practice In The Conduct Of Trade Or Commerce File A Written Demand For Relief With The Alleged Actor At Least Thirty Days Prior To Filing A Lawsuitâ€¦..),RI-H7476


In [37]:
word_counts = pd.DataFrame(
    matrix.toarray(), 
    columns=vectorizer.get_feature_names(),
    index=matches_df.state + "-" + matches_df.bill_number
)

word_counts = word_counts.loc[~(word_counts==0).all(axis=1)]

word_counts = word_counts.replace(0, np.nan) \
    .dropna(axis=1, how='all') \
    .dropna(axis=0, how='all')

word_counts['TOTAL_ngrams_shared'] = word_counts.sum(axis=1)
word_counts = word_counts.sort_values(by='TOTAL_ngrams_shared', ascending=False)
word_counts = word_counts.T

word_counts['TOTAL_bills_used'] = word_counts.sum(axis=1)
word_counts = word_counts.sort_values(by='TOTAL_bills_used', ascending=False)

word_counts.fillna("", inplace=True)

In [38]:
word_counts.head(200)

Unnamed: 0,IN-SB0394,IN-HB1123,WV-SB315,MS-HB1417,IN-SB0211,MS-SB2404,IN-HB1068,OR-SB314,RI-H7476,MO-SB1102,IN-SB0222,MO-HB186,IN-SB0320,IN-HB1405,OR-SB728,IN-HB1055,IN-HB1378,MO-SB489,MO-SB5,MO-SB487,PA-HB228,MT-SB281,PA-HB243,OR-SB976,MO-HB256,MO-HB2089,MO-HB2108,AL-SB270,MO-SB832,OK-SB666,OK-SB743,MO-HB714,MO-SB276,MO-SB62,MO-SB150,PA-HB402,PA-HB638,OK-HB1603,PA-HB2044,RI-H5689,PA-HB475,IL-SB1888,RI-S0493,OK-SB371,OK-SB371.1,OK-SB1226,OK-SB103,PA-SB1247,AL-SB1,WV-SB556,...,OH-SB13,TN-SB1522,TN-HB2008,IL-SB1228,TN-SB0250,TN-HB0182,HI-HB804,NJ-A715,NY-A05247,WV-SB134,NY-A01161,MO-HB676,NY-S00056,MO-HB552,MO-HB550,NY-S00435,NJ-S616,NJ-A4252,NY-A00312,WV-SB113,OH-SB174,NJ-S1537,NY-A00679,NJ-A3497,NY-S02407,NJ-S1033,CA-AB2782,MI-SB0050,HI-SB849,CA-AB2588,NY-S04364,NY-A06655,CA-ABX838,KY-HB84,IL-HB1219,NJ-A3333,NJ-S2293,NJ-S922,NY-S04243,TX-SB1628,NY-A09785,NJ-S1473,NJ-S905,NJ-A303,NJ-A2796,NJ-S1669,NJ-S2855,NJ-A4330,OR-HB2252,TOTAL_bills_used
TOTAL_ngrams_shared,17.0,16.0,15.0,11.0,10.0,10.0,10.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,589.0
ascertainable loss of money or property,,,1.0,1.0,,,,1.0,1.0,1.0,,1.0,,,1.0,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,,,1.0,1.0,1.0,1.0,,1.0,,,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,46.0
entitled to bring an action under,1.0,1.0,,,1.0,,1.0,,1.0,,1.0,,1.0,1.0,,1.0,,,,,,,,,,,,,,,,,,,,,,,,1.0,,,1.0,,,,,,,,...,1.0,,,,,,,,1.0,,1.0,,1.0,,,1.0,,1.0,1.0,,1.0,1.0,,,,1.0,,,,1.0,,,1.0,,,,,1.0,,,1.0,1.0,1.0,,,,,,,37.0
an ascertainable loss of money or,,,1.0,,,,,1.0,,1.0,,1.0,,,1.0,,,1.0,1.0,1.0,,,,1.0,1.0,1.0,1.0,,1.0,,,1.0,1.0,1.0,1.0,,,,,,,,,1.0,1.0,1.0,1.0,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,35.0
act or practice declared unlawful by,,,1.0,,,,,1.0,1.0,1.0,,1.0,,,1.0,,,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,,1.0,,,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,,1.0,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,,,,,,,,33.0
suffers an ascertainable loss of money,,,1.0,,,,,1.0,,1.0,,1.0,,,1.0,,,1.0,1.0,1.0,,,,1.0,1.0,1.0,1.0,,1.0,,,1.0,1.0,1.0,1.0,,,,,,,,,1.0,1.0,1.0,1.0,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,30.0
clear and convincing evidence that the,,,,1.0,,1.0,,,,1.0,,1.0,,,,,,,,,,,,,,,,,,1.0,1.0,,,,,,,1.0,,,,,,,,,,,,,...,,1.0,1.0,,1.0,1.0,1.0,,,1.0,,1.0,,1.0,1.0,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,26.0
by clear and convincing evidence that,,,,,,,,,,1.0,,1.0,,,,,,,,,,,,,,,,,,1.0,1.0,,,,,,,1.0,,,,,,,,,,,,,...,,1.0,1.0,,1.0,1.0,1.0,,,1.0,,1.0,,1.0,1.0,,,,,1.0,,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,25.0
or practice declared unlawful by section,,,1.0,,,,,,1.0,1.0,,1.0,,,,,,1.0,1.0,1.0,1.0,,1.0,,1.0,1.0,1.0,,1.0,,,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,,1.0,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,21.0
fees and costs to prevailing plaintiff,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,1.0,,1.0,,1.0,,,1.0,,,1.0,,,,,,,,,,,,1.0,1.0,,,,,,,,,,,,,,,,,1.0,15.0
